feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model#189
Merged
Merged
Conversation
## InferenceBackend trait (crates/thinking-engine/src/inference_backend.rs)
Runtime-switchable dispatch across all codec/inference paths. Nothing
killed — every research path coexists as a backend variant.
Two key axes documented in the trait module:
Axis 1 — full-path vs leaf-only quantization:
Full-path QJL/PolarQuant: entire row → JL sign+magnitude (~20 B/row)
Leaf-only I8 hybrid: HEEL+HIP location (6b) + i8 JLQ residual (9 B/row)
Passthrough: exact (2×n_cols B/row)
Axis 2 — reconstruction-grade vs signature-grade:
Reconstruction: SafetensorsRaw, BurnFwd, CandleFwd, HhtlF32+SlotL
Signature: RaBitQ, SpiralEncoding, CodecCascade, Base17
Hybrid: I8Hybrid (location + JLQ leaf)
7 backend structs registered in all_backends(). EncodedState enum
carries opaque per-backend state. Trait methods: encode, score,
reconstruct, bytes_per_row, shared_overhead_bytes, grade.
## TurboQuant P5 results (run on Qwen3-TTS-0.6B k_proj [2048,1024])
CRITICAL FINDING: all 4 correction methods (direct i8, Fisher z,
QJL corrected, TurboQuant) hit rho >= 0.997 at single-layer, but
ALL collapse to rho = 0.000 by layer 5 in a 33-layer chain.
Single layer: Fisher z best (rho=0.999), all >= 0.997
Chain L=5: ALL 0.000
Drift/layer: QJL 6x lower bias than direct i8 (doesn't help)
Root cause: variance, not bias. Repeated multiplication of quantized
score matrices amplifies noise beyond recovery. QJL bias correction
is correct but irrelevant when variance dominates.
Implications:
- Path B (cascade inference through 33 layers) NOT VIABLE as
chained score multiplication
- Single-layer cascade IS viable (rho >= 0.997)
- I8 hybrid (HEEL+HIP + JLQ leaf) does f32 reconstruction, not
chained scoring — different quality model, not refuted by this
- Hybrid strategy: cascade per-layer, f32 GEMM between layers
P5 status updated in docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md:
MEASURED — chain collapses, single-layer passes.
https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
Codifies 7 anti-patterns (AP1-AP7) learned from PRs #176-#188 into an agent card that fires flags when the session repeats them: AP1: "225/225 feels like success" without gate 2 (#178) AP2: Projecting quality from docs instead of measuring (#177) AP3: Building new codec before benching existing ones (#184) AP4: Centroid-residual framing on near-orthogonal data (#177/#183) AP5: Python in the inference hot path AP6: Chained score multiplication without chain-collapse check (P5) AP7: Modifying ndarray without explicit permission (#176) Invoked by adk-coordinator when pattern repetition is suspected, or by human directly. Output: list of fired flags, max 7 lines. Also audited all 29 agent cards across both repos: - All pin model: opus or model: sonnet (no hardcoded versions) - opus → Opus 4.7 automatically, sonnet → Sonnet 4.6 - 3 ndarray agents on sonnet (l3-strategist, migration-tracker, product-engineer) — intentional for speed-over-depth roles - adk-coordinator missing Bash tool (by design — delegates) - sentinel-qa missing Edit/Write (by design — audit-only) No agent changes needed for Opus 4.7 compatibility — model: opus resolves correctly. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
## P7: PolarQuant HIP family probe — REFUTED for pure direction split Measured on Qwen3-TTS-0.6B k_proj [2048,1024], 256 rows: Base17 L1 (current): 16.8% within-family NN recall (16/16 families) PolarQuant normalized: 7.8% within-family NN recall (16/16 families) Delta: -9.0% ← PolarQuant is WORSE Root cause: stripping magnitude before clustering loses informative signal. For k_proj rows, magnitude variation correlates with NN structure — rows with similar magnitudes tend to be nearest neighbors. Base17 L1 already encodes a JOINT direction+magnitude opinion through the golden-step fold. Pure-direction families throw away half the coupling. Insight: the "opinion as address" framing is correct, but the opinion must be JOINT direction+magnitude (like BF16's mantissa+exponent), not direction alone. This confirms the logarithmic-scale bgz17 philosophy: u8 encodes both axes simultaneously. Status: P7 REFUTED for PolarQuant-only normalization on k_proj. Base17 L1 families are already sufficient for this tensor shape. May differ for other roles (gate, up, down) — per-role probing is a follow-up. ## InferenceBackend trait (inference_backend.rs) Runtime-switchable dispatch design. 7 backend variants documented with two classification axes: Axis 1: full-path QJL vs leaf-only I8 hybrid vs passthrough Axis 2: reconstruction-grade vs signature-grade vs hybrid Trait: encode → EncodedState, score(i,j), reconstruct(i), grade(). Not yet wired into lib.rs (needs feature gate design for heavy deps). https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
## I9: BF17 shapeshifting Same 16-17 bit wire width carries different constructs at different HHTL levels: BF17 float at HEEL (joint direction+magnitude opinion), 4-bit partition at HIP, 8×i8 PolarQuant coefficients at LEAF. The "shapeshifting" is: exponent bits at HEEL become direction bits at LEAF; mantissa bits at HEEL become magnitude bits at LEAF. Explains WHY PolarQuant-only splitting hurts (P7 result): the coupling between direction and magnitude IS the information at HEEL/HIP level. ## P8: Cronbach's α codec bench — psychometric measurement model Reframes the R&D bench from "horse race" to "psychometric instrument validation." Codec candidates are test items; we measure internal consistency (α) to discover factor structure. ### Epiphany × population correlation matrix Cross-tabulates every invariant (I1-I9) and probe finding (P1-P7) against 6 data populations: attention k_proj, MLP gate, vocab embedding, Jina v5 output, audio codec embeddings, BGE-M3 output. Each cell predicts what should happen if the invariant holds on that population. The bench FILLS the cells. ### Populations chosen for cross-validation Different distribution signatures (near-orthogonal vs unit-normalized vs vocab-sparse vs SiLU-gated vs discrete-latent) ensure the factor structure is real, not artifact of one tensor's shape. ### Metrics 9 metrics per (codec × population) cell. 4 already in bgz_tensor::quality (pearson, spearman, top_k_recall, mae/rmse). 4 NEW to implement (Cronbach's α, Cohen's κ, bias, ICC). https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
R&D framework for codec psychometric benchmarking. Upstream sync, probe results, InferenceBackend trait, agent tooling, measurement model.
What's on this branch (9 commits)
Upstream sync
AdaWorldAPI-lance-graph-d9df43b/(182 files, 3 MB). Full audit: zero content loss, our src is a strict superset. Eliminates GitHub path confusion.spark_dialect.rsfrom upstream PR DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests) #150 — the ONE file upstream has that we didn't (107 LOC Spark SQL dialect + 293 LOC test).Python reference headers
scripts/tts_inference.pyandscripts/bake_hhtld_codebooks.shnow have "REFERENCE ONLY — Rust is canonical" headers pointing to the Rust equivalents.InferenceBackend trait (
crates/thinking-engine/src/inference_backend.rs)Probe results measured on real Qwen3-TTS-0.6B
ADK behavior monitor agent
.claude/agents/adk-behavior-monitor.md— 7 anti-patterns (AP1-AP7) codified from PRs perf(tts_rvq_e2e): AVX-512 F32x16 FMA + AMX polyfill probe; recover AudioNode bridge #176-chore: remove stale upstream snapshot + port spark_dialect from upstream #150 #188. Flags session déjà-vu.All agents → Opus 4.7
model: opus. Zero sonnet.Invariants doc extended (470 LOC)
New invariants:
New probes specified:
Design principle
Nothing retired. Every research path coexists as an InferenceBackend variant. The bench runs all against all, Cronbach's α tells us factor structure, and deprecation is data-driven. Python is prep-only (HF download, ONNX export); Rust is the canonical inference runtime.
Test plan
cargo build --release --example polarquant_hip_probe— cleancargo build --release --example turboquant_correction_probe— cleanNext session entry point
docs/CODEC_INVARIANTS_AND_EXPERIMENTS.md§ P8 has the full measurement model: 7 codecs × 6 populations × 9 metrics × 6 resolution variants. The epiphany × population correlation matrix maps every invariant (I1-I9) to its testable prediction per population. Start by implementingcronbach_alphainbgz_tensor::quality, then the bench fills the matrix.https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj